-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Fix duplicated IDs in ALTO XML when multiple pages are present #4386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for the fix. I just wonder whether we should try to keep the current IDs for the most common use case where only a single page is processed. This could be achieved for example by
|
…nt pge This will ensure, validated ALTO XML output is generated while keeping IDs for the first page consistent as before.
Thanks for your suggestion @stweil! I added a function Disclosure: I haven't written CPP before, so pls check for rookie mistakes :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses the issue of duplicated IDs in ALTO XML output when processing multiple pages by incorporating the page number into the generated IDs.
- Introduces a new helper function (GetID) to generate IDs based on a prefix, page number, and counter.
- Updates ID assignments for elements such as Illustration, GraphicalElement, ComposedBlock, TextBlock, TextLine, and String to ensure uniqueness.
Comments suppressed due to low confidence (1)
src/api/altorenderer.cpp:54
- [nitpick] Consider renaming GetID to 'GenerateUniqueID' to more clearly reflect its role in creating unique IDs across pages.
static std::string GetID(char const * prefix, int page_number, int counter) {
@@ -51,6 +51,20 @@ static void AddBoxToAlto(const ResultIterator *it, PageIteratorLevel level, | |||
} | |||
} | |||
|
|||
static std::string GetID(char const * prefix, int page_number, int counter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static std::string GetID(char const * prefix, int page_number, int counter) { | |
static std::string GetID(const char *prefix, int page_number, int counter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code is technically fine, but my suggested change uses the coding style which is typical for Tesseract code.
When running tesseract on a list of images and validating the alto output using the schema (https://www.loc.gov/alto/v3/alto-3-0.xsd), I had the following error:
Digging into the type of
BlockTypeID
, I found axsd:ID
restriction - meaning theBlockTypeID
can only exist once in the whole document / is unique per document.I incorporated the current
page_number
into the IDs thus resolving the issue.